chore: router ci improvement by miklosbarabas · Pull Request #2234 · wundergraph/cosmo

miklosbarabas · 2025-09-22T08:29:39Z

CI now performs Redis ACL setup and health checks inside the running container rather than on the host.
Host-level cluster setup and host-installed redis-tools are no longer used, reducing runner dependencies.
Standardized container-scoped port usage and dynamic container selection to improve CI stability and reduce flakiness.
No user-facing changes; improves build reliability and contributor experience.
Removed unnecessary unofficial GH Actions to reduce supply-chain attack surface

Summary by CodeRabbit

CI
- Integration tests now provision a Redis cluster with health checks, automatic readiness waiting, dynamic ACL/config application, and simplified test invocation.
Tests
- Added configurable retries for failed tests.
- Flaky event-stream and batch tests moved into dedicated flaky suites.
- A specific rate-limit cluster-mode test was converted to a non-executing placeholder pending follow-up.
Chores
- Upgraded test-runner tooling to a newer version.

Checklist

I have discussed my proposed changes in an issue and have received approval to proceed.
I have followed the coding standards of the project.
Tests or benchmarks have been added or updated.
Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
I have read the Contributors Guide.

- improved redis usage from the ci - renamed TestNatsEvents to TestFlakyNatsEvents to be retried on failure - commented out suspected faulty test

coderabbitai · 2025-09-22T08:29:49Z

Walkthrough

CI adds three Redis cluster services, a readiness/initialization step, and dynamic ACL/config application. Several flaky tests are moved into new top-level test functions or commented out. gotestsum invocation and version updated, and test rerun support added to Makefiles.

Changes

Cohort / File(s)	Summary of changes
CI workflow & Redis cluster orchestration `.github/workflows/router-ci.yaml`	Adds public services `redis-0`, `redis-1`, `redis-2` (image `bitnamilegacy/redis-cluster:7.2`) with env, healthchecks, ports; removes docker login/credentials under `redis`; adds “Wait for Redis Cluster” readiness/init step; replaces fixed-port ACL loop with container discovery and ACL application; removes prior dedicated cluster setup step; simplifies integration test invocation.
NATS flaky test extraction `router-tests/events/nats_events_test.go`	Removes the embedded “subscribe ws with filter” block from `TestNatsEvents` and adds a new top-level `func TestFlakyNatsEvents(t *testing.T)` containing that logic.
Batch flaky test extraction `router-tests/batch_test.go`	Moves/duplicates the inlined Batch Tracing assertions into a new top-level `func TestFlakyBatch(t *testing.T)` and removes the inlined nested test.
Rate limit test change `router-tests/ratelimit_test.go`	Removes a specific Cluster Mode test case from the active matrix and leaves it commented with a TODO (test converted to a non-executed placeholder).
Test tooling / Makefile updates `router-tests/Makefile`, `Makefile`	Adds `test_retry_count ?= 0`; updates `gotestsum` invocation in `router-tests/Makefile` to include `--rerun-fails="$(test_retry_count)"`, `--packages="$(test_target)"`, and `-f $(FORMAT)`; bumps `gotest.tools/gotestsum` version from `v1.12.3` to `v1.13.0` in `Makefile` setup target.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(router): improved heartbeats for subscriptions #2141 — Updates in router-tests/events/nats_events_test.go overlap (both modify/flank flaky NATS event tests), likely related to test reorganization.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly identifies the PR as non-user-facing (“chore”) and clearly indicates that the primary change is improving the router CI pipeline, matching the overall objectives described in the PR without extraneous detail.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-22T08:31:47Z

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-7101448bdbf338d4891559945882db7d3ca43776

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/router-ci.yaml (1)
321-331: Regex excludes many legitimate tests; tighten to “not starting with TestFlaky”.

^Test[^(Flaky)] will skip any test whose first char after “Test” is F/l/a/k/y/“(”. Replace with an alternation that matches everything except TestFlaky….
-        run: make test test_params="-run '^Test[^(Flaky)]' --timeout=5m -p 1 --parallel 10" test_target="${{ matrix.test_target }}"
+        run: make test test_params="-run '^(Test[^F].*|TestF[^l].*|TestFl[^a].*|TestFla[^k].*|TestFlak[^y].*)$' --timeout=5m -p 1 --parallel 10" test_target="${{ matrix.test_target }}"
If you’d prefer simpler ops, an alternative is to keep -run '^Test' here and ensure all flaky suites are skipped via t.Skip behind an env flag (e.g., SKIP_FLAKY=1) and run them explicitly in the flaky step with that flag unset.

🧹 Nitpick comments (3)

router-tests/ratelimit_test.go (1)

703-707: Don’t comment out tests; skip with rationale and TODO.

Convert the commented case into a t.Run that calls t.Skipf(...) explaining the cluster default-user password caveat. Keeps the test discoverable and documents intent.

-				// Seems like not working as expected ? (if password is configured for the cluster default user)
-				//{
-				//	name:            "should successfully use auth from later url if no auth in first urls",
-				//	clusterUrlSlice: []string{"redis://localhost:7003", "rediss://localhost:7001", "rediss://cosmo:test@localhost:7002"},
-				//},
+				{
+					name:            "should successfully use auth from later url if no auth in first urls (skipped)",
+					clusterUrlSlice: []string{"redis://localhost:7003", "rediss://localhost:7001", "rediss://cosmo:test@localhost:7002"},
+				},

And in the loop:

-			for _, tt := range tests {
+			for _, tt := range tests {
 				t.Run(tt.name, func(t *testing.T) {
 					t.Parallel()
+					if strings.Contains(tt.name, "(skipped)") {
+						t.Skipf("Disabled under CI when cluster default user requires auth; follow-up needed to assert client URL auth merge semantics independently of server auth policy.")
+					}

.github/workflows/router-ci.yaml (2)

192-233: Redis Cluster services: good coverage; consider pinning image digests.

The 3-node cluster setup looks solid. To reduce supply-chain drift in CI, pin images to immutable digests (e.g., bitnamilegacy/redis-cluster:7.2@sha256:...).

309-319: Inline credentials in CI script (CKV_SECRET_4).

Even though this is test infra, embedding cosmo:test in URIs triggers secret scanners and isn’t future‑proof. Use env vars in the step and reference them.

-      - name: Configure Redis Authentication & ACL
-        run: |
+      - name: Configure Redis Authentication & ACL
+        env:
+          REDIS_TEST_USER: cosmo
+          REDIS_TEST_PASSWORD: test
+        run: |
           docker ps -a
           # Set a password for each master node
-          for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
+          for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
             echo "Configuring ACLs in container $cid"
-            docker exec "$cid" redis-cli -p 6379 ACL SETUSER cosmo on ">test" "~*" "+@all"
+            docker exec "$cid" redis-cli -p 6379 ACL SETUSER "$REDIS_TEST_USER" on ">$REDIS_TEST_PASSWORD" "~*" "+@all"
             docker exec "$cid" redis-cli -p 6379 ping
           done
-          cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
+          cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
           # Sanity checks
           docker exec "$cid" redis-cli -p 6379 ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" cluster nodes
+          docker exec "$cid" redis-cli -u "redis://${REDIS_TEST_USER}:${REDIS_TEST_PASSWORD}@redis-0:6379" ping
+          docker exec "$cid" redis-cli -u "redis://${REDIS_TEST_USER}:${REDIS_TEST_PASSWORD}@redis-0:6379" cluster nodes

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae65619 and dac721e.

📒 Files selected for processing (3)

.github/workflows/router-ci.yaml (3 hunks)
router-tests/events/nats_events_test.go (1 hunks)
router-tests/ratelimit_test.go (1 hunks)

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: build-router
GitHub Check: build_push_image
GitHub Check: build_push_image (nonroot)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: image_scan (nonroot)
GitHub Check: image_scan
GitHub Check: integration_test (./events)
GitHub Check: build_test
GitHub Check: integration_test (./telemetry)
GitHub Check: build_test
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (1)

router-tests/events/nats_events_test.go (1)

65-66: Rename to TestFlaky… is fine; verify CI selection regex won’t over-exclude.

The rename cleanly gates this suite behind “flaky” runs. However, the CI regex used to exclude flakies (^Test[^(Flaky)]) will also skip any tests starting with TestF…, Testl…, Testa…, etc. Please update the workflow regex as suggested in router-ci.yaml to avoid unintentionally dropping non‑flaky tests.

- using gotestsum to retry flaky tests

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

router-tests/Makefile (1)

4-4: Sanity‑check --rerun-fails behavior in your CI version of gotestsum.

Recent reports show --rerun-fails can sometimes yield a zero exit code despite failures, depending on version/output parsing. If you see anomalies, disable it or add --rerun-fails-report to surface retries. (github.com)

If helpful, I can add a CI‑guard to toggle reruns (e.g., RERUN_FLAKY?=1 and $(if $(RERUN_FLAKY),--rerun-fails=$(rerun_fails_count))).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dac721e and cfb3597.

📒 Files selected for processing (1)

router-tests/Makefile (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: Analyze (go)

- using gotestsum to retry flaky tests

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/router-ci.yaml (1)
159-161: Runner label typo blocks the job from scheduling.

Use ubuntu-latest; ubuntu-latest-l is invalid.
-    runs-on: ubuntu-latest-l
+    runs-on: ubuntu-latest

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfb3597 and 61f6f77.

📒 Files selected for processing (2)

.github/workflows/router-ci.yaml (3 hunks)
router-tests/Makefile (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

router-tests/Makefile

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: build-router
GitHub Check: integration_test (./telemetry)
GitHub Check: build_test
GitHub Check: build_push_image
GitHub Check: integration_test (./events)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan
GitHub Check: image_scan (nonroot)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (2)

.github/workflows/router-ci.yaml (2)

256-294: Harden the “Wait for Redis Cluster” probe (strict mode, fail-fast, pick the creator by intent).

Same as the earlier suggestion; current script may silently pass/fail and pick an arbitrary node.

-      - name: Wait for Redis Cluster
-        run: |
-          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
-          cluster_containers=$(docker ps --quiet --filter "ancestor=bitnamilegacy/redis-cluster:7.2")
-
-          success=0
-          for i in {1..30}; do
-            if [ $i -eq 1 ]; then
-              echo "[INIT] Forcing cluster creation..."
-              # pick one container as the "creator"
-              creator=$(echo $cluster_containers | awk '{print $1}')
-              # run the cluster create command inside it
-              docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
-            fi
-
-            for cid in $cluster_containers; do
-              docker exec "$cid" redis-cli -p 6379 cluster info
-              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
-                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"
-                success=1
-                break 2
-              fi
-            done
-
-            echo "[WAITING] Cluster not ready yet (attempt $i)..."
-            sleep 2
-          done
-
-          if [ $success -eq 0 ]; then
-            echo "[ERROR] Redis Cluster did not become healthy in time"
-            for cid in $cluster_containers; do
-              echo "--- Cluster info for $cid ---"
-              docker exec "$cid" redis-cli -p 6379 cluster info || true
-              docker exec "$cid" redis-cli -p 6379 cluster nodes || true
-              echo "--- Logs for $cid ---"
-              docker logs "$cid" | tail -n 100
-            done
-            exit 1
-          fi
+      - name: Wait for Redis Cluster
+        run: |
+          set -euo pipefail
+          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
+          cluster_containers=$(docker ps --format '{{.ID}} {{.Names}} {{.Image}}' | awk '$3 ~ /bitnamilegacy\/redis-cluster:7\.2/ {print $1 "|" $2}')
+          if [ -z "${cluster_containers}" ]; then
+            echo "[ERROR] No redis-cluster containers found"; docker ps -a; exit 1
+          fi
+          # Prefer the intended creator (REDIS_CLUSTER_CREATOR=yes) via inspect; fallback to container named *redis-2*; else first
+          creator=""
+          while IFS= read -r pair; do
+            cid="${pair%%|*}"; name="${pair##*|}"
+            if docker inspect -f '{{range .Config.Env}}{{println .}}{{end}}' "$cid" | grep -q '^REDIS_CLUSTER_CREATOR=yes$'; then creator="$cid"; fi
+            [ -n "${first:-}" ] || first="$cid"
+            if [ -z "$creator" ] && echo "$name" | grep -q 'redis-2'; then creator="$cid"; fi
+          done <<< "$cluster_containers"
+          creator="${creator:-$first}"
+          echo "[INIT] Forcing cluster creation from $creator ..."
+          docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
+
+          for i in {1..30}; do
+            for pair in $cluster_containers; do
+              cid="${pair%%|*}"
+              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
+                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"; exit 0
+              fi
+            done
+            echo "[WAITING] Cluster not ready yet (attempt $i)..."; sleep 2
+          done
+          echo "[ERROR] Redis Cluster did not become healthy in time"
+          for pair in $cluster_containers; do
+            cid="${pair%%|*}"; echo "--- Cluster info for $cid ---"
+            docker exec "$cid" redis-cli -p 6379 cluster info || true
+            docker exec "$cid" redis-cli -p 6379 cluster nodes || true
+            echo "--- Logs for $cid ---"; docker logs "$cid" | tail -n 200 || true
+          done
+          exit 1

192-233: Keep port publishing — tests use localhost:7001/7002/7003

Search confirms tests reference the exposed host ports; do not remove port publishing.

Locations:

router-tests/testenv/testenv.go:399 — redisClusterHost = "redis://localhost:7001"
router-tests/ratelimit_test.go:686, 701, 710, 755 — clusterUrlSlice references localhost:7001/7002/7003
router-tests/automatic_persisted_queries_test.go:414 — clusterUrls include localhost:7001/7002

- using gotestsum to retry flaky tests

- using gotestsum to retry flaky tests - removed non-official custom gh actions to reduce supply-chain attack vector

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/router-ci.yaml (1)

192-233: Redis cluster services: good direction; pin images and prefer deterministic creator selection.

Pin images by digest for determinism/supply-chain.

Minor: since only redis-2 is the creator, add a comment to keep this invariant or derive creator in the wait script (preferred).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c635a4 and d105701.

📒 Files selected for processing (1)

.github/workflows/router-ci.yaml (3 hunks)

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-router
GitHub Check: build_push_image
GitHub Check: image_scan (nonroot)
GitHub Check: image_scan
GitHub Check: build_push_image (nonroot)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: integration_test (./events)
GitHub Check: build_test
GitHub Check: Analyze (go)

🔇 Additional comments (3)

.github/workflows/router-ci.yaml (3)

256-294: Harden the wait probe; fail-fast, pick the intended creator, keep diagnostics.

Same concern as the prior review: add strict shell, verify containers found, prefer the creator node, and preserve logs on failure.

-      - name: Wait for Redis Cluster
-        run: |
-          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
-          cluster_containers=$(docker ps --quiet --filter "ancestor=bitnamilegacy/redis-cluster:7.2")
-
-          success=0
-          for i in {1..30}; do
-            if [ $i -eq 1 ]; then
-              echo "[INIT] Forcing cluster creation..."
-              # pick one container as the "creator"
-              creator=$(echo $cluster_containers | awk '{print $1}')
-              # run the cluster create command inside it
-              docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
-            fi
-
-            for cid in $cluster_containers; do
-              docker exec "$cid" redis-cli -p 6379 cluster info
-              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
-                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"
-                success=1
-                break 2
-              fi
-            done
-
-            echo "[WAITING] Cluster not ready yet (attempt $i)..."
-            sleep 2
-          done
-
-          if [ $success -eq 0 ]; then
-            echo "[ERROR] Redis Cluster did not become healthy in time"
-            for cid in $cluster_containers; do
-              echo "--- Cluster info for $cid ---"
-              docker exec "$cid" redis-cli -p 6379 cluster info || true
-              docker exec "$cid" redis-cli -p 6379 cluster nodes || true
-              echo "--- Logs for $cid ---"
-              docker logs "$cid" | tail -n 100
-            done
-            exit 1
-          fi
+      - name: Wait for Redis Cluster
+        run: |
+          set -euo pipefail
+          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
+          cluster_containers=$(docker ps --format '{{.ID}} {{.Names}} {{.Image}}' | awk '$3 ~ /bitnamilegacy\/redis-cluster:7\.2/ {print $1 "|" $2}')
+          if [ -z "${cluster_containers}" ]; then
+            echo "[ERROR] No redis-cluster containers found"; docker ps -a; exit 1
+          fi
+          creator=""
+          first=""
+          while IFS= read -r line; do
+            cid="${line%%|*}"; name="${line##*|}"
+            [ -n "${first}" ] || first="$cid"
+            if echo "$name" | grep -q 'redis-2'; then creator="$cid"; fi
+          done <<< "$cluster_containers"
+          creator="${creator:-$first}"
+          echo "[INIT] Forcing cluster creation from $creator ..."
+          timeout 5s docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
+
+          for i in {1..30}; do
+            for pair in $cluster_containers; do
+              cid="${pair%%|*}"
+              if timeout 3s docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
+                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"; exit 0
+              fi
+            done
+            echo "[WAITING] Cluster not ready yet (attempt $i)..."; sleep 2
+          done
+          echo "[ERROR] Redis Cluster did not become healthy in time"
+          for pair in $cluster_containers; do
+            cid="${pair%%|*}"; echo "--- Cluster info for $cid ---"
+            docker exec "$cid" redis-cli -p 6379 cluster info || true
+            docker exec "$cid" redis-cli -p 6379 cluster nodes || true
+            echo "--- Logs for $cid ---"; docker logs "$cid" | tail -n 200 || true
+          done
+          exit 1

309-318: Don’t embed credentials; use env/secrets and redis-cli --user/-a. Fix CKV_SECRET_4.

This leaks creds into logs and triggers Checkov. Use masked env and avoid URIs.

-          # Set a password for each master node
-          for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
-            echo "Configuring ACLs in container $cid"
-            docker exec "$cid" redis-cli -p 6379 ACL SETUSER cosmo on ">test" "~*" "+@all"
-            docker exec "$cid" redis-cli -p 6379 ping
-          done
-          cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
-          # Sanity checks
-          docker exec "$cid" redis-cli -p 6379 ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" cluster nodes
+          # Configure ACLs (secrets masked by Actions)
+          ACL_USER="${ACL_USER:-cosmo}"
+          ACL_PASS="${CI_REDIS_PASS:-test}"
+          echo "::add-mask::$ACL_PASS"
+          for cid in $(docker ps --format '{{.ID}} {{.Image}}' | awk '$2 ~ /redis-cluster/ {print $1}'); do
+            echo "Configuring ACLs in container $cid"
+            docker exec "$cid" redis-cli -p 6379 ACL SETUSER "$ACL_USER" on ">$ACL_PASS" "~*" "+@all"
+            docker exec "$cid" redis-cli -p 6379 ping >/dev/null
+          done
+          cid=$(docker ps --format '{{.ID}} {{.Image}}' | awk '$2=="redis:7"{print $1}' | head -n1)
+          # Sanity checks (no URI with creds in logs)
+          docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" ping
+          docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" cluster nodes

Optionally add to the step:
env:
CI_REDIS_PASS: ${{ secrets.CI_REDIS_PASS }}
ACL_USER: cosmo

321-321: Retries confirmed — no change needed

router-tests/Makefile line 15 invokes gotestsum with --rerun-fails="$(test_retry_count)", so the workflow's test invocation will trigger retries.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/router-ci.yaml (1)

192-233: Redis Cluster services: consider avoiding host port exposure and pin image by digest.

You can rely on service DNS (redis-0/1/2:6379) from the job container and drop host port mappings (7001-7003) to reduce collision risk.

Pin bitnamilegacy/redis-cluster:7.2 by digest to harden supply chain and avoid accidental tag drift.

If DockerHub rate limiting is a concern, consider mirroring to GHCR or using an org-managed cache.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d105701 and 13abcb8.

📒 Files selected for processing (1)

.github/workflows/router-ci.yaml (3 hunks)

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: build-router
GitHub Check: build_test
GitHub Check: integration_test (./events)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: integration_test (./telemetry)
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan
GitHub Check: image_scan (nonroot)
GitHub Check: build_test
GitHub Check: build_push_image
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (5)

.github/workflows/router-ci.yaml (5)

256-294: Harden the Redis Cluster wait/probe; fail fast and prefer the creator node.

Add strict shell options, error out when no containers found, and prefer the intended creator (redis-2). Keep diagnostics on failure.

-      - name: Wait for Redis Cluster
-        run: |
-          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
-          cluster_containers=$(docker ps --quiet --filter "ancestor=bitnamilegacy/redis-cluster:7.2")
-
-          success=0
-          for i in {1..30}; do
-            if [ $i -eq 1 ]; then
-              echo "[INIT] Forcing cluster creation..."
-              # pick one container as the "creator"
-              creator=$(echo $cluster_containers | awk '{print $1}')
-              # run the cluster create command inside it
-              docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
-            fi
-
-            for cid in $cluster_containers; do
-              docker exec "$cid" redis-cli -p 6379 cluster info
-              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
-                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"
-                success=1
-                break 2
-              fi
-            done
-
-            echo "[WAITING] Cluster not ready yet (attempt $i)..."
-            sleep 2
-          done
-
-          if [ $success -eq 0 ]; then
-            echo "[ERROR] Redis Cluster did not become healthy in time"
-            for cid in $cluster_containers; do
-              echo "--- Cluster info for $cid ---"
-              docker exec "$cid" redis-cli -p 6379 cluster info || true
-              docker exec "$cid" redis-cli -p 6379 cluster nodes || true
-              echo "--- Logs for $cid ---"
-              docker logs "$cid" | tail -n 100
-            done
-            exit 1
-          fi
+      - name: Wait for Redis Cluster
+        run: |
+          set -euo pipefail
+          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
+          cluster_containers=$(docker ps --format '{{.ID}} {{.Names}} {{.Image}}' | awk '$3 ~ /bitnamilegacy\/redis-cluster:7\.2/ {print $1 "|" $2}')
+          if [ -z "${cluster_containers}" ]; then
+            echo "[ERROR] No redis-cluster containers found"; docker ps -a; exit 1
+          fi
+          # Prefer the creator node if present; otherwise pick the first
+          creator=""
+          while IFS= read -r line; do
+            cid="${line%%|*}"; name="${line##*|}"
+            if echo "$name" | grep -q "redis-2"; then creator="$cid"; fi
+            [ -n "${first:-}" ] || first="$cid"
+          done <<< "$cluster_containers"
+          creator="${creator:-$first}"
+          echo "[INIT] Forcing cluster creation from $creator ..."
+          docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
+
+          for i in {1..30}; do
+            for pair in $cluster_containers; do
+              cid="${pair%%|*}"
+              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
+                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"; exit 0
+              fi
+            done
+            echo "[WAITING] Cluster not ready yet (attempt $i)..."; sleep 2
+          done
+          echo "[ERROR] Redis Cluster did not become healthy in time"
+          for pair in $cluster_containers; do
+            cid="${pair%%|*}"; echo "--- Cluster info for $cid ---"
+            docker exec "$cid" redis-cli -p 6379 cluster info || true
+            docker exec "$cid" redis-cli -p 6379 cluster nodes || true
+            echo "--- Logs for $cid ---"; docker logs "$cid" | tail -n 200 || true
+          done
+          exit 1

306-318: Don’t embed credentials; pass via env and use redis-cli --user/-a.

Avoid leaking creds into logs and satisfy CKV_SECRET_4.

       # Set a password for each master node
-      for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
-        echo "Configuring ACLs in container $cid"
-        docker exec "$cid" redis-cli -p 6379 ACL SETUSER cosmo on ">test" "~*" "+@all"
-        docker exec "$cid" redis-cli -p 6379 ping
+      ACL_USER="${ACL_USER:-cosmo}"
+      ACL_PASS="${ACL_PASS:-test}"
+      export ACL_USER ACL_PASS
+      set +x
+      for cid in $(docker ps --format '{{.ID}} {{.Image}}' | awk '$2=="bitnamilegacy/redis-cluster:7.2"{print $1}'); do
+        echo "Configuring ACLs in container $cid"
+        docker exec "$cid" redis-cli -p 6379 ACL SETUSER "$ACL_USER" on ">$ACL_PASS" "~*" "+@all"
+        docker exec "$cid" redis-cli -p 6379 --user "$ACL_USER" -a "$ACL_PASS" ping
       done
-      cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
+      cid=$(docker ps --format '{{.ID}} {{.Image}}' | awk '$2=="redis:7"{print $1}' | head -n1)
       # Sanity checks
-      docker exec "$cid" redis-cli -p 6379 ping
-      docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" ping
-      docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" cluster nodes
+      docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" ping
+      docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" cluster nodes

Note: if you can use repo secrets, prefer setting ACL_USER/ACL_PASS via secrets and omit the default.

192-233: Claim vs implementation: host ports mapped contradict “container-scoped” usage.

PR summary says “container-scoped port usage,” but 7001-7003 are host-published here.

Can tests resolve redis-0/1/2 by service DNS and drop the port publishes?

159-159: Verify runner label “ubuntu-latest-l”.

Is this a custom self-hosted label? If not, GitHub’s standard label is “ubuntu-latest”.

321-321: LGTM on test invocation.

The tighter timeout and explicit parallelism look fine.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

router-tests/ratelimit_test.go (1)

703-707: Consider opening a tracking issue for the commented-out test.

The TODO note explains that the test was a false positive and requires investigation. To ensure this task is not forgotten, consider opening a tracking issue (if one doesn't already exist) to investigate and fix the underlying functionality.

Do you want me to open a new issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53e7e88 and e27cc5c.

📒 Files selected for processing (1)

router-tests/ratelimit_test.go (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: build-router
GitHub Check: Analyze (go)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: build_push_image
GitHub Check: image_scan
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan (nonroot)
GitHub Check: integration_test (./events)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: build_test

endigma

LGTM

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

router-tests/events/nats_events_test.go (2)
1614-1724: Consider documenting the flakiness root cause.

The test implementation looks correct and comprehensive. However, isolating flaky tests without addressing the root cause can lead to accumulating technical debt, as noted in past review comments.

Consider adding a comment explaining:

Why this test is flaky (e.g., timing-sensitive SSE event delivery, NATS message ordering)

Whether the flakiness is environmental or indicates a potential product issue

A reference to track investigation of the root cause (e.g., issue number)

This documentation will help future contributors understand whether the flakiness is acceptable or needs fixing.

Example:
 func TestFlakyNatsEvents(t *testing.T) {
 	t.Parallel()
+	// TODO(ENG-XXXX): This test is flaky due to timing sensitivity in SSE event delivery
+	// when multiple events are published rapidly. The test occasionally misses events
+	// or receives them out of order. This is likely a test infrastructure issue rather
+	// than a product bug, but should be investigated.
 
 	t.Run("subscribe sse with filter", func(t *testing.T) {
1614-1616: Establish pattern for additional flaky tests.

Good separation of flaky tests into a dedicated top-level function. If additional tests need to be marked as flaky in the future, consider adding them as sub-tests within TestFlakyNatsEvents rather than creating multiple TestFlaky* functions. This keeps all flaky tests in one place and makes CI configuration simpler.

Based on past review comments, which mentioned potential flakiness in Kafka and Redis tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e27cc5c and 6ba2c3c.

📒 Files selected for processing (2)

router-tests/events/nats_events_test.go (1 hunks)
router-tests/ratelimit_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

router-tests/ratelimit_test.go

🧰 Additional context used

🧬 Code graph analysis (1)

router-tests/events/nats_events_test.go (2)

router-tests/testenv/testenv.go (4)

Run (107-124)

Config (286-343)

ConfigWithEdfsNatsJSONTemplate (87-87)

Environment (1731-1767)

router-tests/testenv/utils.go (1)

AwaitChannelWithT (10-19)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: build-router
GitHub Check: build_push_image
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan (nonroot)
GitHub Check: image_scan
GitHub Check: integration_test (./events)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: build_test
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (1)

router-tests/events/nats_events_test.go (1)

1614-1724: Confirm removal of "subscribe sse with filter" from TestNatsEvents; only one occurrence remains in TestFlakyNatsEvents.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

router-tests/batch_test.go (1)

750-803: Clarify test duplication and flaky subset.

This new TestFlakyBatch function appears to duplicate logic from the existing "Verify all root span attributes for batch requests" test in TestBatch (lines 565-631), but with significantly reduced attribute coverage (7 attributes vs 24).

Given the past review discussion about isolating specific flaky tests, please clarify:

Is this the minimal flaky subset, or are you still determining which specific assertions are flaky?

Why were these particular 7 attributes selected for the flaky test?

Should the original comprehensive test in TestBatch remain, or will it be removed once flakiness is resolved?

The reduced coverage could mask issues with the untested attributes, so understanding the rationale will help ensure test quality.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34f2aa8 and 4cfadd8.

📒 Files selected for processing (1)

router-tests/batch_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

router-tests/batch_test.go (4)

router-tests/testenv/testenv.go (4)

Run (107-124)

Config (286-343)

Environment (1731-1767)

GraphQLRequest (1907-1915)

router/pkg/trace/tracetest/tracetest.go (1)

NewInMemoryExporter (11-17)

router/pkg/config/config.go (2)

Config (998-1072)

BatchingConfig (879-884)

router/pkg/otel/attributes.go (7)

WgRouterConfigVersion (21-21)

WgRouterRootSpan (32-32)

WgIsBatchingOperation (49-49)

WgBatchingOperationsCount (50-50)

WgOperationHash (14-14)

WgClientName (18-18)

WgClientVersion (19-19)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: build-router
GitHub Check: image_scan
GitHub Check: build_push_image (nonroot)
GitHub Check: build_push_image
GitHub Check: image_scan (nonroot)
GitHub Check: integration_test (./telemetry)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: build_test
GitHub Check: integration_test (./events)
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (1)

router-tests/batch_test.go (1)

791-791: Dynamic config version check differs from original test.

Line 791 uses xEnv.RouterConfigVersionMain() for dynamic version checking, while the original test at line 610 uses sa.HasValue(otel.WgRouterConfigVersion) to check for attribute existence.

This difference could explain flakiness if the config version value changes across test runs. However, the original comprehensive test also passes, suggesting this might not be the root cause.

Consider verifying whether the config version value is stable in your CI environment.

chore: router ci improvement

dac721e

- improved redis usage from the ci - renamed TestNatsEvents to TestFlakyNatsEvents to be retried on failure - commented out suspected faulty test

miklosbarabas requested review from a team, Noroth, StarpTech, devsergiy, endigma and jensneuse as code owners September 22, 2025 08:29

miklosbarabas requested a review from thisisnithin September 22, 2025 08:29

miklosbarabas mentioned this pull request Sep 22, 2025

chore: router ci improvement #2229

Closed

5 tasks

coderabbitai Bot reviewed Sep 22, 2025

View reviewed changes

Comment thread .github/workflows/router-ci.yaml

miklosbarabas and others added 2 commits September 22, 2025 11:06

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

f38b748

chore: router ci improvement

cfb3597

- using gotestsum to retry flaky tests

coderabbitai Bot reviewed Sep 22, 2025

View reviewed changes

Comment thread router-tests/Makefile Outdated

Comment thread router-tests/Makefile Outdated

chore: router ci improvement

61f6f77

- using gotestsum to retry flaky tests

coderabbitai Bot reviewed Sep 22, 2025

View reviewed changes

Comment thread .github/workflows/router-ci.yaml

chore: router ci improvement

2c635a4

- using gotestsum to retry flaky tests

github-actions Bot added the monorepo label Sep 22, 2025

chore: router ci improvement

d105701

- using gotestsum to retry flaky tests - removed non-official custom gh actions to reduce supply-chain attack vector

coderabbitai Bot reviewed Sep 22, 2025

View reviewed changes

endigma requested changes Sep 23, 2025

View reviewed changes

Comment thread .github/workflows/router-ci.yaml

Comment thread .github/workflows/router-ci.yaml

Comment thread router-tests/Makefile

chore: router ci improvement

13abcb8

coderabbitai Bot reviewed Sep 23, 2025

View reviewed changes

miklosbarabas and others added 5 commits September 25, 2025 01:50

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

dd83f07

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

3b9e5fe

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

83aedde

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

5c732b8

chore: router ci improvement

1bfdc73

miklosbarabas added 2 commits October 2, 2025 23:38

chore: router ci improvement

9838dcc

chore: router ci improvement

53e7e88

miklosbarabas requested a review from endigma October 2, 2025 22:24

chore: router ci improvement

e27cc5c

coderabbitai Bot reviewed Oct 2, 2025

View reviewed changes

miklosbarabas enabled auto-merge (squash) October 3, 2025 06:15

SkArchon reviewed Oct 3, 2025

View reviewed changes

Comment thread router-tests/events/nats_events_test.go Outdated

ysmolski approved these changes Oct 3, 2025

View reviewed changes

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

f0a1269

endigma approved these changes Oct 3, 2025

View reviewed changes

alepane21 requested changes Oct 3, 2025

View reviewed changes

Comment thread router-tests/batch_test.go Outdated

Comment thread router-tests/events/nats_events_test.go Outdated

miklosbarabas and others added 2 commits October 3, 2025 12:41

Merge branch 'main' into miklos/eng-8201-chore-router-ci-improvements

ec6400b

chore: router ci improvement

6ba2c3c

coderabbitai Bot reviewed Oct 3, 2025

View reviewed changes

miklosbarabas added 2 commits October 3, 2025 13:02

chore: router ci improvement

34f2aa8

chore: router ci improvement

4cfadd8

coderabbitai Bot reviewed Oct 3, 2025

View reviewed changes

miklosbarabas requested a review from alepane21 October 3, 2025 11:32

alepane21 approved these changes Oct 3, 2025

View reviewed changes

miklosbarabas merged commit fea4d1d into main Oct 3, 2025
29 checks passed

miklosbarabas deleted the miklos/eng-8201-chore-router-ci-improvements branch October 3, 2025 12:21

coderabbitai Bot mentioned this pull request Nov 25, 2025

feat: initial codecov changes #2360

Merged

5 tasks

coderabbitai Bot mentioned this pull request Jan 14, 2026

fix: remove wrong test case #2453

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 3, 2026

fix(ci): improve reliability of the router-tests #2577

Merged

3 tasks

Conversation

miklosbarabas commented Sep 22, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Checklist

Uh oh!

coderabbitai Bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

github-actions Bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Router image scan passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

endigma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

miklosbarabas commented Sep 22, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 22, 2025 •

edited

Loading

github-actions Bot commented Sep 22, 2025 •

edited

Loading